Server-side Content Storage

After the discussions we had previously we settled on the idea that the new storage solution for the provisioning subsystem in RHQ should be based on a 3rd party library / server so that we don't have to come up with scaling of our own home-grown solution.

In the end, only 2 possible candidates were left:

JCR repository (Modeshape)
S-RAMP

Both of the choices above offer the possibility of integrating with the existing infrastructure that the users already have in place (well, less so in the case of S-RAMP which doesn't yet have a full-featured released implementation). While a simple JCR repository offers exactly what we most probably need from a content storage and not much more (which is a good thing), it would be a shame if we haven't tried to integrate with S-RAMP that offers integration with business rules and, more importantly from the point of view of a pure repository, standardized way of classifying and querying for the artifacts. One of the main strengths of S-RAMP is its ability to "annotate" artifacts with metadata which then can be exploited for various other uses-cases, be it impact analysis, dependency analysis, dynamic composition of more complex artifacts (bundles from constituent parts), mapping the logical constructs (like for example assignment of an artifact to an "environment" - prod, dev, staging) to concrete physical layout known by RHQ, etc. The possibilities for more intelligent artifact manipulation are quite vast and it would be a shame to throw that all out.

On the other hand, we need to take into account the deployment of even more complex RHQ setup into consideration. Today, we have RHQ server, DB and Cassandra cluster. Introducing either Modeshape or S-RAMP (which is based on Modeshape) adds another item to this list - a cluster of Infinispan nodes.

Repositories

The current storage of content in RHQ has a notion of "content source" and a "repository". Content source is a pluggable way of connecting to a third party content store and pulling the data out of it. Such content can then be "pushed" into one or more repositories. Internally, the content sources just "pour" stuff into one big pool of "packages" and repositories are merely views showing a subset of the whole pool.

Given that we could not decide on one storage solution that would be exclusively used in RHQ we need to keep some kind of pluggability in the new system, too, enabling us to connect to either S-RAMP or a JCR (or something else, if we decide to in the future).

New Model

Referencing Content

In the new model, no data about the content or metadata available in the repositories will be stored in the database (well, this might not be true for authz - we might need to store the content group assignment somewhere - this might be supported by the content store itself (i.e. a custom property in JCR or S-RAMP, or we might need a DB table for this). There will be objects, of course, that will represent the content and its metadata but those will be DTOs that will accompany the content on its path from the repository to the agent.

In the histories, the content will be addressed by an URI that will just describe the location of the content.

The Role of Repositories

Unlike today, the repositories will serve the role of "views" into the remote repository (either JCR or S-RAMP). In essence, the repository will become what is today understood by a content source.

The repositories will thus be able to connect to a remote content store and view the "files" in it. Because we want to be able to address a JCR, we need the files in the repository to be hierarchical.

A repository will be identified by an URI. It will be possible to provide additional configuration necessary to connect to such URI. E.g. jcr://localhost:1234 and the configuration will contain credentials needed to connect to the repository.

The URI is repository-specific, i.e. in case of S-RAMP, one might envision the URI to also contain some search query to limit the artifacts showing up in the repo, i.e.: s-ramp://localhost:8080/s-ramp-server/s-ramp/ext/WarDocument[ @classifiedByAnyOf(., 'testsPassed')]. This might prove problematic though, because we could then have problems figuring out which repository a file belongs to. Maybe a better approach would be to consider the repository URI the absolute location of it and introduce a concept of "saved searches". A file would then be located in the repository but could be visible through multiple saved searches.

What Describes a Piece of Content?

A pair of URIs. The URI of the repository the content comes from + the URI of the piece of content describing the location of it in the repository. Given the fact that the URI of the piece of content would always be relative, we could actually trim it down to a single URI.

The only other piece of information needed apart from the URI is the mimetype. This can either be provided by the repository or can be guessed from the file extension. The file extension is deduced from the URI.

Internal Repositories

Apart from being able to connect to remote repositories, RHQ needs to have the ability to store the contents on its own in some embedded storage.

This is mainly because of the 2 use cases that we support today:

We have a concept of "private" repositories where users can upload scripts they want to use as alert scripts (and I also have secret plans to add more wide-spread scripting to other parts of the server-side functionality ).
More importantly though, I think it would be beneficial for the user to start be able to put today's bundles into the repositories, too. Today, bundles live in one big pool, but as people start to use them more and more, I think it would make sense to be able to split this pool into more granular "pools".

For both of those usecases, we need to provide a storage out-of-the-box.

The reason why we should abandon the current db-bound storage is mainly the difficulties we have with scaling that approach out. Modeshape (and therefore S-RAMP, too) is much more versatile in this regard at the expense of the more complex configuration. At the same time, we are going ahead with the concept of "storage nodes" for metric data, which we could theoretically enhance to store content, too. The two types of storage might not share the backend and they might even be separate processes but I think it would make sense for the "storage nodes" for both types of the data to be co-located.

Apart from the choice of technology to realize the internal content storage, one of the most important problems we need to solve is the difference in authorization between remote and internal repositories.

Authorization Considerations In Repositories

TODO - THIS NEEDS TO CHANGE - we need SSO to make the UIs cooperate, so why not also use SSO when accessing the repos? What are the consequences for deployment logic?

The remote repositories will be accessed using the provided "connection properties". This might mean that they are readonly to RHQ or that putting any content into them will be done using such credentials. There will be no concept of SSO that would restrict one RHQ user from uploading stuff into a repository while allowing other RHQ user to do so (this might be accomplished by permissions on RHQ side, but no information about the current RHQ user will be passed over to the remote repository).

For RHQ 4.8 we're preparing new fine-grained authorization of bundles in the current bundles/content system that will be designed with the future retro-fitting to the new provisioning subsystem in mind. The good thing about that approach is that it is going to fit to the new system without the need to add any new permissions. The only thing that will need to change is the association of repositories to roles. Such association will make it possible for the user to create/delete bundles/packages/files/ only in the repositories associated with one or more of the user's roles. The fine-grained bundle permissions already know the "CREATE_BUNDLES" and "DELETE_BUNDLES" permissions but those are applied "globally". In the new system these permissions would be tied to the specific repositories.

Conversion of the Current Repository Authz to the New Model

Today, the authz on repositories is quite simple: any user can create any number of repositories. A repository has an owner and can be either private or public. Private repositories are only accessible by their owners, public repositories are globally readable but only the owner has update/delete rights. There also exists a permission called "MANAGE_REPOSITORIES" that implicitly grants all rights on the repositories, including the ability to change an owner of a repo.

This is going to change for the new provisioning subsystem:

Users will no longer be able to create repositories. Only the repository admin will be able to do that. The reason for that is that the repositories will now be remotable and possibly much more content will be stored in them. By only allowing certain users to create repositories, we're handing the responsibility for keeping track of the number and purpose of the repos to them.
It will no longer be possible to have private repositories. The accessibility of content is governed by the content groups and thus the private flag on a repo no longer makes much sense.
Writability of repositories will be governed by their relationship to roles having CREATE_BUNDLES and DELETE_BUNDLES permissions.

Enhancements to Bundle Groups

The proposal for bundle permissions as it stands right now creates an analogy for resource groups in the bundle subsystem. With the introduction of scalable content storage and the ability to connect to remote repositories, I think it will be important to further deepen that analogy and come up with the equivalent of dyna-groups for the provisioning system. This will greatly ease the role of the administrators that would otherwise need to manually assign each and every bundle/file from every repository to a set of groups.

Integration with S-RAMP

On the level of repository access, there is not much we can use to integrate more tightly/meaningfully with S-RAMP. We will provide integration for S-RAMP asking RHQ to do stuff (i.e. ask RHQ to deploy something, etc). Going the other way around, i.e. sending some kind of message from RHQ to S-RAMP is possible, too, but due to the very open-ended architecture of S-RAMP I can't imagine a precanned set of things we'd be providing - rather, all I can envision is a for example a set of CLI scripts that could be used in alert notifications and adopted for given S-RAMP installation (there can be different classifications available depending on the business workflow attached to artifacts, etc).

The one thing we can do to optimize the communication with an S-RAMP repository is that if an S-RAMP repository is present as a remote repository in RHQ, the S-RAMP server could be able to send just links to files rather than send them over to RHQ which would store them temporarily somewhere and do its magic with them. If only a link was sent, we could theoretically save on the disk space needed to process the deployment, but I consider that only a marginal feature, not really important for the whole integration to work.

Of course one of the nice things we could do, which is already mentioned above, is the ability to filter down on the artifacts made accessible to RHQ by specifying an S-RAMP query directly in the "connection details" of the S-RAMP repository.

Metadata usage

S-RAMP provides 3 kinds of metadata on artifacts:

properties
relationships
classifications

Properties and relationships are either defined by "models", which are plugins in S-RAMP so to speak that can "understand" certain type of artifact, or ad-hoc client defined. Classifications are supplied in the form of OWL definitions and clients can use them to divide the artifacts into hierarchical groupings.

RHQ Artifacts and Models in S-RAMP

The only "interesting" artifact that is specific to RHQ is the bundle. We would provide the RHQ model and artifact types into S-RAMP that would be able to analyze the zip files and provide additional metadata if the zip file was a bundle. The metadata would include:

a derived artifact called recipe with:
- bundle name and version as properties
we'd discover the other artifacts in the bundle zip (by merely unzipping it)
relationship (isDeployedBy) from the recipe to the artifacts in the bundle based on the <rhq:deploymentUnit> definitions in the bundle recipe.

Mappings

There is quite a big number of usecases that we currently lack the support for that could be modelled by S-RAMP's properties and relationships (like detecting if a deployment requires other deployments as prerequisities, etc.). All we need is to come up with them, make them available in S-RAMP and implement the logic in the RHQ server. The flexibility of the S-RAMP data model gives us quite a bit of freedom in how and when these should be introduced.

Another part of the equation is to instruct RHQ to understand and use user-defined classifications. Classifications together with jBPM workflows are the main types of "things" that would be supplied to S-RAMP to implement user-defined functionality. An example of the workflow would be the development lifecycle of a webapp, as shown in the S-RAMP demo materials. The example of a classification would be the individual phases of the development lifecycle. The workflow would "move" the artifacts from phase to phase and change the classifiers of the artifacts accordingly.

The "meaning" of the individual classifiers is only known to the clients. From the point of RHQ, all we can do is to "assign" various types of RHQ entities to such classifiers. For example the user could say that the classifier "Staging" corresponds to a certain resource group (for certain types of artifacts). This would provide an RHQ-specific (and user-defined) "meaning" of the classifier. Such mapping could then be used to re-deploy artifacts to different servers. Probably we'd need to not only specify that a "Staging" classifier represents a resource group but we'd probably also need to specify what actions are expected when an artifact becomes classified by such classifier and what is expected to happen after an artifact is no longer classified. While this does look like quite a lot of work for the admin of RHQ - this work would be done once and would then work for any workflow in Overlord/S-RAMP. Alternatively, the business logic can be directly executed from the workflows using our REST API.

Migration

Content Repositories

MANAGE_REPOSITORIES will remain the same
the private flag on the repositories will be removed.
For each user a new role will be created with CREATE_BUNDLES and DELETE_BUNDLES permissions, to which all the user's repositories will be assigned. This will give the users control over their private and public repos the same way as they used to have.
A new dynamic content group will be created that will match all the content in all the public repositories. All users will become members of a new role to which this new content group will be assigned. This will give everyone access to the public repositories the same way as they used to have.

Content Sources

It is theoretically possible to create compatibility layer that would transfer our current content sources into the new model as new remote repositories. I consider that a low priority though.

Bundles

Bundles will be converted from the current storage model to simple files in the new provisioning model.
All the bundles will be moved to a new repository called "Bundles".
The bundle groups will remain the same.